Download a spaCy model to train.
In [1]:
!python -m spacy download en_core_web_sm
Import libraries we'll need.
In [2]:
from __future__ import unicode_literals, print_function
import boto3
import json
import numpy as np
import pandas as pd
import spacy
Bring in Verta's ModelDB client to organize our work, and log and version metadata.
In [3]:
from verta import Client
client = Client('https://app.verta.ai')
proj = client.set_project('Tweet Classification')
expt = client.set_experiment('SpaCy')
Download a dataset of English tweets from S3 for us to train with.
In [4]:
S3_BUCKET = "verta-starter"
S3_KEY = "english-tweets.csv"
FILENAME = S3_KEY
boto3.client('s3').download_file(S3_BUCKET, S3_KEY, FILENAME)
Then we'll load and clean the data.
In [5]:
import utils
data = pd.read_csv(FILENAME).sample(frac=1).reset_index(drop=True)
utils.clean_data(data)
data.head()
Out[5]:
We'll first capture metadata about our code, configuration, dataset, and environment using utilities from the verta
library.
In [6]:
from verta.code import Notebook
from verta.configuration import Hyperparameters
from verta.dataset import S3
from verta.environment import Python
code_ver = Notebook() # Notebook & git environment
config_ver = Hyperparameters({'n_iter': 20})
dataset_ver = S3("s3://{}/{}".format(S3_BUCKET, S3_KEY))
env_ver = Python() # pip environment and Python version
Then, to log them, we'll use a ModelDB repository to prepare a commit.
In [7]:
repo = client.set_repository('Tweet Classification')
commit = repo.get_commit(branch='master')
Now we'll add these versioned components to the commit and save it to ModelDB.
In [8]:
commit.update("notebooks/tweet-analysis", code_ver)
commit.update("config/hyperparams", config_ver)
commit.update("data/tweets", dataset_ver)
commit.update("env/python", env_ver)
commit.save("Initial model")
commit
Out[8]:
We'll use the pre-trained spaCy model we downloaded earlier...
In [9]:
nlp = spacy.load('en_core_web_sm')
...and fine-tune it with our dataset.
In [10]:
import training
training.train(nlp, data, n_iter=20)
Now that our model is good to go, we'll log it to ModelDB so our progress is never lost.
Using Verta's ModelDB Client, we'll create an Experiment Run to encapsulate our work, and log our model as an artifact.
In [11]:
run = client.set_experiment_run()
run.log_model(nlp)
And finally, we'll link the commit we created earlier to the Experiment Run to complete our logged model version.
In [12]:
run.log_commit(
commit,
{
'notebook': "notebooks/tweet-analysis",
'hyperparameters': "config/hyperparams",
'training_data': "data/tweets",
'python_env': "env/python",
},
)
Now we've consolidated all the information we would need to reproduce this model later, or revisit the work we've done!
Proceed to the second notebook to see how problematic commits can be reverted.